Useful tools for easy explatory data analysis (EDA)
Off the shelf and simple functions for data analysis
- pyviz
- List item
def resumetable(df):
# Summerizes dataframe, types, count, missing, etc.
print(f"Dataset Shape: {df.shape}")
summary = pd.DataFrame(df.dtypes,columns=['dtypes'])
summary = summary.reset_index()
summary['Name'] = summary['index']
summary = summary[['Name','dtypes']]
summary['Missing'] = df.isnull().sum().values
summary['Uniques'] = df.nunique().values
summary['First Value'] = df.iloc[0].values
summary['Second Value'] = df.iloc[1].values
return summary
Lets test it out with some data that came from the Melanoma kaggle contest. This is metadata associated with the imaages of potential melanoma.
from google.colab import drive
mnt=drive.mount('/content/gdrive', force_remount=True)
root_dir = "/content/gdrive/My Drive/"
base_dir = root_dir + 'melanoma_328'
csv_path
!pwd
Typically, the first thing to do is examine the first rows of data
df = pd.read_csv(csv_path) # pydicom metadata train baseline
df.head()
I found resumetable to be very convenient. We get a sense of cardinaltiy from Uniques and we can easily see where we are missing data. For example, there are 65 cases of Patient's_Sex that will need to be accounted for.
Looks like there are 33,126 unique images, but only 23,978 unique study times. Perhaps there are really only 23,978 unique images and the other 9,148 images are odified duplicates.
resumetable(df)
Another tool I use is [data
import sys
#!"{sys.executable}" -m pip install -U pandas-profiling[notebook]
!jupyter nbextension enable --py widgetsnbextension
from pathlib import Path
import pandas as pd
from ipywidgets import widgets
# Our package
from pandas_profiling import ProfileReport
from pandas_profiling.utils.cache import cache_file
csv_path = (base_dir+'/combined_meta_df_reduced.csv')
df = pd.read_csv(csv_path)
profile = ProfileReport(df, title="Titanic Dataset", html={"style": {"full_width": True}}, sort="None")
Takes a couple minutes to process and display the results.
You do get some richer analysis like Correlation plots and distributions of the variables.
profile.to_widgets()
An alternative is Sweetviz
import sweetviz as sv
sweet_report = sv.analyze(df)
sweet_report.show_notebook('sweet_report.html')